AITopics | example problem

Collaborating Authors

example problem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Sun, Yiyou, Hu, Shawn, Zhou, Georgia, Zheng, Ken, Hajishirzi, Hannaneh, Dziri, Nouha, Song, Dawn

arXiv.org Artificial IntelligenceJun-24-2025

Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.1888

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Complex LLM Planning via Automated Heuristics Discovery

Ling, Hongyi, Parashar, Shubham, Khurana, Sambhav, Olson, Blake, Basu, Anwesha, Sinha, Gaurangi, Tu, Zhengzhong, Caverlee, James, Ji, Shuiwang

arXiv.org Artificial IntelligenceFeb-26-2025

We consider enhancing large language models (LLMs) for complex planning tasks. While existing methods allow LLMs to explore intermediate steps to make plans, they either depend on unreliable self-verification or external verifiers to evaluate these steps, which demand significant data and computations. Here, we propose automated heuristics discovery (AutoHD), a novel approach that enables LLMs to explicitly generate heuristic functions to guide inference-time search, allowing accurate evaluation of intermediate states. These heuristic functions are further refined through a heuristic evolution process, improving their robustness and effectiveness. Our proposed method requires no additional model training or fine-tuning, and the explicit definition of heuristic functions generated by the LLMs provides interpretability and insights into the reasoning process. Extensive experiments across diverse benchmarks demonstrate significant gains over multiple baselines, including nearly twice the accuracy on some datasets, establishing our approach as a reliable and interpretable solution for complex planning tasks.

arxiv preprint arxiv, heuristic function, llm, (15 more...)

arXiv.org Artificial Intelligence

2502.19295

Country:

North America > United States > Texas (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

SBSC: Step-By-Step Coding for Improving Mathematical Olympiad Performance

Singh, Kunal, Biswas, Ankan, Bhowmick, Sayandeep, Moturi, Pradeep, Gollapalli, Siva Kishore

arXiv.org Artificial IntelligenceFeb-23-2025

We propose Step-by-Step Coding (SBSC): a multi-turn math reasoning framework that enables Large Language Models (LLMs) to generate sequence of programs for solving Olympiad level math problems. At each step/turn, by leveraging the code execution outputs and programs of previous steps, the model generates the next sub-task and the corresponding program to solve it. This way, SBSC, sequentially navigates to reach the final answer. SBSC allows more granular, flexible and precise approach to problem-solving compared to existing methods. Extensive experiments highlight the effectiveness of SBSC in tackling competition and Olympiad-level math problems. For Claude-3.5-Sonnet, we observe SBSC (greedy decoding) surpasses existing state-of-the-art (SOTA) program generation based reasoning strategies by absolute 10.7% on AMC12, 8% on AIME and 12.6% on MathOdyssey. Given SBSC is multi-turn in nature, we also benchmark SBSC's greedy decoding against self-consistency decoding results of existing SOTA math reasoning strategies and observe performance gain by absolute 6.2% on AMC, 6.7% on AIME and 7.4% on MathOdyssey.

conference paper, sbsc, sympy import symbol, (15 more...)

arXiv.org Artificial Intelligence

2502.16666

Country: Asia > India > Maharashtra > Mumbai (0.04)

Genre:

Workflow (1.00)
Research Report > New Finding (0.92)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Zhang, Beichen, Liu, Yuhong, Dong, Xiaoyi, Zang, Yuhang, Zhang, Pan, Duan, Haodong, Cao, Yuhang, Lin, Dahua, Wang, Jiaqi

arXiv.org Artificial IntelligenceJan-6-2025

Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try' strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6\% and 2.0\% respectively on various mathematical benchmarks, and 7.5\% gain combined with MCTS.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.03226

Country: Asia > China (0.28)

Genre:

Workflow (0.94)
Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Add feedback

Give me a hint: Can LLMs take a hint to solve math problems?

Agrawal, Vansh, Singla, Pratham, Miglani, Amitoj Singh, Garg, Shivank, Mangal, Ayush

arXiv.org Artificial IntelligenceNov-9-2024

While state-of-the-art LLMs have shown poor logical and basic mathematical reasoning, recent works try to improve their problem-solving abilities using prompting techniques. We propose giving "hints" to improve the language model's performance on advanced mathematical problems, taking inspiration from how humans approach math pedagogically. We also test robustness to adversarial hints and demonstrate their sensitivity to them. We demonstrate the effectiveness of our approach by evaluating various diverse LLMs, presenting them with a broad set of problems of different difficulties and topics from the MATH dataset and comparing against techniques such as one-shot, few-shot, and chain of thought prompting.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.05915

Country:

Asia > India > Uttarakhand > Roorkee (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models

Luo, Kangyang, Ding, Zichen, Weng, Zhenmin, Qiao, Lingfeng, Zhao, Meng, Li, Xiang, Yin, Di, Shu, Jinlong

arXiv.org Artificial IntelligenceOct-29-2024

Prior to our efforts, there has already been work striving towards this goal. For example, Self-ICL (Chen et al., 2023) begins by prompting the LLM to generate few-shot new, diverse, and creative proxy queries tailored to the target task, and then solves each of that independently using the ZS-CoT manner, which in turn yields proxy exemplars for prompting LLMs to engage in reasoning. Auto-ICL (Yang et al., 2023) operates similarly to Self-ICL, but it differs in that Auto-ICL instructs the LLM to produce proxy queries that have the same structure as the given query. Analogical Prompting (Yasunaga et al., 2023) draws on the cognitive process of solving new problems from relevant past experiences, i.e., inspired by analogical reasoning, which prompts the language model to self-generate relevant examples in context before embarking on the solution of a given query. Notably, the one-pass generation mode employed in Analogical Prompting necessitates that the LLM possesses robust capabilities for both following instructions and generating responses. We revisit the aforementioned approaches and discern that their efficacy hinges on guiding the LLM to recall experiences relevant to the given query. However, solely considering such experiences may lead to the generation of proxy queries that are as challenging as the given query, along with corresponding erroneous proxy solutions, potentially misleading the solution of the original given query.

example problem, proxy query, query, (9 more...)

arXiv.org Artificial Intelligence

2410.21728

Country:

Atlantic Ocean (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education (1.00)
Transportation (0.68)
Health & Medicine > Therapeutic Area (0.67)
Leisure & Entertainment > Sports (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Zhang, Shudan, Zhao, Hanlin, Liu, Xiao, Zheng, Qinkai, Qi, Zehan, Gu, Xiaotao, Zhang, Xiaohan, Dong, Yuxiao, Tang, Jie

arXiv.org Artificial IntelligenceMay-7-2024

Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.

language model, proceedings, test case, (16 more...)

arXiv.org Artificial Intelligence

2405.0452

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.05)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Tutorial and Practice in Linear Programming: Optimization Problems in Supply Chain and Transport Logistics

Bridgelall, Raj

arXiv.org Artificial IntelligenceMay-4-2023

This tutorial is an andragogical guide for students and practitioners seeking to understand the fundamentals and practice of linear programming. The exercises demonstrate how to solve classical optimization problems with an emphasis on spatial analysis in supply chain management and transport logistics. All exercises display the Python programs and optimization libraries used to solve them. The first chapter introduces key concepts in linear programming and contributes a new cognitive framework to help students and practitioners set up each optimization problem. The cognitive framework organizes the decision variables, constraints, the objective function, and variable bounds in a format for direct application to optimization software. The second chapter introduces two types of mobility optimization problems (shortest path in a network and minimum cost tour) in the context of delivery and service planning logistics. The third chapter introduces four types of spatial optimization problems (neighborhood coverage, flow capturing, zone heterogeneity, service coverage) and contributes a workflow to visualize the optimized solutions in maps. The workflow creates decision variables from maps by using the free geographic information systems (GIS) programs QGIS and GeoDA. The fourth chapter introduces three types of spatial logistical problems (spatial distribution, flow maximization, warehouse location optimization) and demonstrates how to scale the cognitive framework in software to reach solutions. The final chapter summarizes lessons learned and provides insights about how students and practitioners can modify the Phyton programs and GIS workflows to solve their own optimization problem and visualize the results.

artificial intelligence, constraint, optimization problem, (15 more...)

arXiv.org Artificial Intelligence

2211.07345

Country: North America > United States (1.00)

Genre:

Workflow (1.00)
Instructional Material > Course Syllabus & Notes (1.00)
Summary/Review (0.94)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
Energy > Oil & Gas (1.00)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)

Add feedback

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Lai, Yuhang, Li, Chengxi, Wang, Yiming, Zhang, Tianyi, Zhong, Ruiqi, Zettlemoyer, Luke, Yih, Scott Wen-tau, Fried, Daniel, Wang, Sida, Yu, Tao

arXiv.org Artificial IntelligenceNov-18-2022

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.

large language model, logic & formal reasoning, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2211.11501

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (0.63)
(2 more...)

Add feedback

A Tour of Machine Learning Algorithms

#artificialintelligenceSep-25-2021, 14:02:01 GMT

In this post, we will take a tour of the most popular machine learning algorithms. It is useful to tour the main algorithms in the field to get a feeling of what methods are available. There are so many algorithms that it can feel overwhelming when algorithm names are thrown around and you are expected to just know what they are and where they fit. I want to give you two ways to think about and categorize the algorithms you may come across in the field. Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a tour of a variety of different algorithm types.

algorithm, classification and regression, machine learning, (13 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.30)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback